Skip to content

fix(sandbox): rewrite messaging credential placeholders#1286

Draft
ericksoa wants to merge 17 commits intoNVIDIA:mainfrom
ericksoa:fix/872-websocket-credential-rewrite
Draft

fix(sandbox): rewrite messaging credential placeholders#1286
ericksoa wants to merge 17 commits intoNVIDIA:mainfrom
ericksoa:fix/872-websocket-credential-rewrite

Conversation

@ericksoa
Copy link
Copy Markdown

@ericksoa ericksoa commented May 9, 2026

Summary

Fixes #872 by making OpenShell the relay-boundary credential substitution point for messaging APIs. The PR keeps the opt-in WebSocket text-frame rewrite path and adds REST request-body credential rewrite for inspected REST endpoints.

Provider-managed credentials stay out of sandbox env values, argv, config files, logs, and sandbox-authored placeholder strings. OpenShell resolves them only at explicitly configured inspected egress paths, then fails closed if a reserved placeholder or provider-shaped alias remains unresolved.

Related Issue

Fixes #872

Reviewer Map

  • Policy/schema/CLI/docs surface: proto/sandbox.proto, crates/openshell-policy, crates/openshell-cli, crates/openshell-providers, crates/openshell-server, docs/reference/policy-schema.mdx, docs/sandboxes/policies.mdx, and docs/security/best-practices.mdx.
  • Credential placeholder and alias resolution: crates/openshell-sandbox/src/secrets.rs, proxy.rs, l7/rest.rs, l7/relay.rs, and l7/websocket.rs.
  • WebSocket protocol enforcement and GraphQL-over-WebSocket policy: crates/openshell-sandbox/src/l7/websocket.rs, l7/graphql.rs, l7/mod.rs, l7/relay.rs, l7/rest.rs, opa.rs, and sandbox-policy.rego.
  • Regression and conformance coverage: crates/openshell-sandbox tests, e2e/rust/tests/websocket_conformance.rs, tasks/test.toml, and .github/workflows/websocket-conformance.yml.

Changes

  • Preserve the canonical openshell:resolve:env:KEY placeholder contract and revisioned forms.
  • Add provider-neutral whole-token aliases containing OPENSHELL-RESOLVE-ENV-KEY, such as provider-OPENSHELL-RESOLVE-ENV-API_TOKEN, for provider/env resolution.
  • Treat provider-shaped tokens as generic aliases for the canonical provider/env resolver; the final token format comes from the resolved secret value, not from provider-specific resolver branches.
  • Resolve canonical placeholders and aliases in request targets, query values, headers, supported REST request bodies, and opted-in WebSocket client text frames.
  • Support percent-encoded canonical placeholders in request targets and query values, so callers do not need a decode proxy for URL-encoded placeholders.
  • Add endpoint policy field request_body_credential_rewrite, defaulting to false.
  • Add incremental CLI endpoint option request-body-credential-rewrite, accepted only for protocol: rest.
  • Wire the new field through proto conversion, YAML serde, provider profiles, merge behavior, OPA JSON conversion, L7 config parsing, CLI summaries, sandbox-local policy round trips, and docs.
  • Buffer up to 256 KiB for UTF-8 application/json, application/x-www-form-urlencoded, and text/* REST request bodies when body rewrite is enabled.
  • Recompute Content-Length after REST body rewrite.
  • Preserve unsupported and binary request bodies when they do not contain reserved markers, and reject unsupported body framing or content types when a reserved marker is present.
  • Keep the existing WebSocket work in this PR: strict RFC 6455 frame validation, handshake response validation, safe RFC 7692 permessage-deflate negotiation, raw opt-out behavior, GraphQL-over-WebSocket operation policy, and no-payload/no-secret logging.

Security and Spec Notes

  • Credential rewrite is opt-in per endpoint and tied to inspected relay paths.
  • Unresolved canonical placeholders or OPENSHELL-RESOLVE-ENV aliases fail closed on paths that inspect credentials.
  • The relay does not log resolved secrets, unresolved placeholders, payloads, or secret lengths.
  • Binary WebSocket frames are not rewritten.
  • Selective credential passthrough is not implemented.
  • Generic WebSocket payload-content policy matching is not implemented; GraphQL-over-WebSocket uses structured GraphQL operation semantics.
  • Unsafe compression variants are not supported.

Testing

Focused validation run on pushed head e59ab4dbcf7f7753b25850d66a23127d107cdc85:

  • cargo fmt
  • cargo test -p openshell-cli policy_update
  • cargo test -p openshell-policy
  • cargo test -p openshell-sandbox credential
  • cargo test -p openshell-sandbox l7
  • cargo test -p openshell-server summarize_cli_policy_merge_op
  • cargo build -p openshell-cli --bin openshell
  • cargo clippy -p openshell-sandbox -p openshell-policy --all-targets -- -D warnings
  • mise run docs
  • mise run python:proto
  • mise run pre-commit

Autobahn was not run because the repo does not currently include an Autobahn harness. This PR adds focused in-tree WebSocket protocol and compression regression coverage plus a dedicated Docker-backed WebSocket conformance e2e lane; the workflow is manual today with a maintainer decision below on scheduling.

Remaining Maintainer Decisions

  • Whether to turn .github/workflows/websocket-conformance.yml into scheduled nightly coverage after manual burn-in.
  • Whether to add first-class CLI sugar for GraphQL policy authoring later. If added, it should cover GraphQL-over-HTTP and GraphQL-over-WebSocket together.
  • Whether OpenShell should introduce a dedicated realtime/subscription access preset later. This PR keeps subscriptions explicit unless access: full is used.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Docs updated for user-facing policy and CLI changes

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 9, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

ericksoa added 4 commits May 8, 2026 19:58
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

WebSocket implementation alignment note

This PR follows common hardening patterns used by established WebSocket clients, servers, and proxy implementations, while keeping the implementation tailored to OpenShell's credential-boundary model.

The important alignment points are behavioral:

  • validate the HTTP upgrade response before treating the connection as a WebSocket;
  • track the original client handshake offer and reject upstream responses that select options the relay did not safely offer;
  • parse frames strictly, including masking, reserved bits, opcodes, fragmentation, UTF-8 text, close frames, and bounded message sizes;
  • keep compression support intentionally narrow, deterministic, and fail-closed;
  • avoid payload logging and expose only safe metadata for observability;
  • preserve raw pass-through behavior when an endpoint has not opted into credential rewrite or semantic message policy;
  • use structured GraphQL operation policy for GraphQL-over-WebSocket instead of generic payload-content matching.

This matters because OpenShell is not trying to become a general WebSocket application framework. The purpose here is narrower: preserve the no-secret-in-sandbox invariant, then resolve placeholders only at the OpenShell relay boundary for opted-in client-to-server text frames. The safest shape is therefore a small, explicit protocol surface: if the relay can validate and process the negotiated WebSocket state, it proceeds; if not, it closes before forwarding an unsafe upgraded flow.

The remaining confidence-building work is conformance and durability, not feature expansion. The follow-up path is to keep the in-tree RFC regression matrix, maintain the Docker-backed WebSocket conformance e2e lane, and consider an Autobahn-style relay harness or parser fuzzing after this PR lands.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

Follow-up hardening pass

This pass applies common WebSocket proxy hardening patterns to the OpenShell relay: parse negotiation as structured protocol data, validate server-selected options against the original client offer, canonicalize only the supported safe extension forms, and fail closed when the rewrite path sees parameters it does not explicitly support.

Concretely, fix(sandbox): harden websocket negotiation parsing adds RFC 6455 subprotocol tracking and response validation, replaces ad hoc Sec-WebSocket-Extensions splitting with token-aware parsing, canonicalizes only the safe permessage-deflate no-context-takeover subset, rejects duplicate/value-bearing/quoted unsafe extension parameters in the rewrite path, and keeps WebSocketExtensionMode::Preserve raw so opt-out behavior does not change.

This matters for credential rewriting because the proxy becomes a protocol participant after it mutates an upgraded flow. Loose negotiation can let the upstream select compression state or subprotocol behavior the relay did not offer or understand, which creates room for state confusion, fail-open compression handling, or secret-bearing payloads moving through a path that no longer has the invariants the rewriter depends on. The new behavior keeps the supported surface narrow and deterministic: either the relay can validate and safely process the negotiated shape, or the handshake fails before forwarding the 101 to the client.

Remaining conformance work is still useful, especially an Autobahn-style relay harness and broader fuzz/property coverage for handshake parsing and frame sequencing. Those should come after this focused spec pass so future parser or engine swaps have a stable regression matrix to preserve.

@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

I have read the DCO document and I hereby sign the DCO.

@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

I have read the Contributor Agreement including DCO and I hereby sign the Contributor Agreement and DCO

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

Focused conformance matrix added

I added test(sandbox): add websocket conformance relay matrix as the practical next step before a full Autobahn lane. The new in-crate harness drives the real HTTP 101 relay path and then sends post-upgrade frames through three representative modes:

  • raw preserve mode, proving opt-out behavior still relays frames without mask validation or rewriting;
  • rewrite mode without negotiated compression, proving a masked text credential is rewritten only after a validated upgrade;
  • rewrite mode with the safe permessage-deflate subset, proving canonical extension negotiation carries through to compressed text rewriting with RSV1 preserved.

This is not a replacement for Autobahn. It is the closest low-cost regression layer for this PR because it covers the specific OpenShell behavior boundary: once the proxy becomes an active WebSocket participant for credential rewriting, the handshake assumptions, compression negotiation, and post-upgrade frame relay need to be tested together rather than only as isolated parser/frame unit tests. Full Autobahn remains useful as a follow-up CI or manual conformance job, but it would add Docker/runtime/tooling weight beyond this focused PR pass.

ericksoa added 2 commits May 8, 2026 22:04
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

Progress update: policy configuration now treats WebSocket as a first-class inspected protocol in the incremental update flow.

What changed:

  • openshell policy update --add-endpoint host:port:read-write:websocket:enforce is now accepted, so users do not have to hand-edit YAML just to create a WebSocket L7 endpoint.
  • --add-allow and --add-deny can target existing protocol: websocket endpoints using the existing method/path schema. For WebSocket, GET means the RFC 6455 upgrade and WEBSOCKET_TEXT means client text messages on the upgraded request path; this deliberately does not add payload-content matching.
  • Policy merge access expansion is now protocol-aware: WebSocket read-only expands to GET, WebSocket read-write expands to GET plus WEBSOCKET_TEXT, and WebSocket presets no longer expand into REST write methods.
  • CLI help and policy/security docs now document the WebSocket workflow, the WEBSOCKET_TEXT method, and the current rule-level constraints.

Why it matters:
This makes the WebSocket credential-rewrite capability operationally usable through the same policy-update workflow as HTTP. A maintainer can start with a minimal endpoint, audit denials, then add specific WebSocket handshake/text-frame rules without replacing the full policy. It also keeps the schema natural: transport selection stays on protocol, permissions stay in method/path rules, and WebSocket text-frame policy remains explicitly separate from message payload inspection.

Validation:

  • cargo fmt
  • cargo test -p openshell-cli policy_update
  • cargo test -p openshell-policy merge
  • cargo test -p openshell-sandbox websocket
  • cargo test -p openshell-sandbox secrets
  • cargo test -p openshell-policy
  • cargo clippy -p openshell-cli -p openshell-sandbox -p openshell-policy --all-targets -- -D warnings
  • mise run pre-commit

ericksoa added 2 commits May 8, 2026 22:23
…dential-rewrite

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

Progress update: WebSocket credential replacement is now configurable from the incremental policy workflow and covered through the route-selected relay path.

What changed:

  • Added openshell policy update --websocket-credential-rewrite for --add-endpoint operations. This lets users create or merge a protocol: websocket endpoint and enable text-frame credential placeholder replacement without hand-editing YAML.
  • The flag is intentionally constrained: it is accepted only for added protocol: websocket endpoints or protocol: rest compatibility endpoints that may upgrade to WebSocket. It is rejected for plain L4 endpoints and protocol: sql.
  • Merge behavior now preserves/merges websocket_credential_rewrite: true when an incoming endpoint is merged into an existing endpoint.
  • Gateway/server-side merge summaries now surface websocket_credential_rewrite=true so audit/progress output records when this sensitive behavior is enabled, without logging secret material.
  • Added a route-selected WebSocket regression proving the main value path: sandbox sends an openshell:resolve:env:* placeholder in a masked client text frame, OpenShell resolves it after a valid 101 upgrade, upstream receives the real credential, and the forwarded client-to-server frame remains masked.
  • Updated the policy docs and CLI examples to show the WebSocket credential rewrite workflow.

Why it matters:
The point of this capability is not only to parse WebSockets; it is to let agents call real WebSocket services while keeping service credentials out of sandbox-authored code and logs. This makes that operational path first-class: policy can allow the upgrade and client text messages, then explicitly opt in to resolving OpenShell placeholders at the relay boundary. The agent still sends placeholders, OpenShell injects the credential only at egress time, binary frames remain untouched, and no payload-content policy matching was added.

Validation:

  • cargo fmt
  • cargo test -p openshell-cli policy_update
  • cargo test -p openshell-policy merge
  • cargo test -p openshell-server summarize_cli_policy_merge_op_formats_websocket_credential_rewrite
  • cargo test -p openshell-sandbox route_selected_websocket_rewrites_text_credentials_after_upgrade
  • cargo test -p openshell-sandbox websocket
  • cargo test -p openshell-sandbox secrets
  • cargo test -p openshell-policy
  • cargo clippy -p openshell-cli -p openshell-sandbox -p openshell-policy -p openshell-server --all-targets -- -D warnings
  • mise run pre-commit

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

Follow-up update after the policy CLI shape review:

  • Kept the WebSocket conformance workflow manual-only. The workflow remains workflow_dispatch without a schedule.
  • Changed incremental policy update syntax so WebSocket credential rewrite follows the existing endpoint-local pattern. The new CLI form is:
openshell policy update demo \
  --add-endpoint realtime.example.com:443:read-write:websocket:enforce:websocket-credential-rewrite \
  --binary /usr/bin/node

Rationale: --binary is command-wide because binaries live on the network rule and already apply to each endpoint added in the same command. protocol, access, enforcement, and websocket_credential_rewrite are endpoint fields, so rewrite should be expressed on the --add-endpoint spec instead of through a broad flag that accidentally applies to every endpoint in the batch. This keeps the WebSocket extension aligned with the existing REST/WebSocket endpoint schema and makes mixed batches safer to review.

Message-content policy would buy a different class of control than this PR currently implements: it would let OpenShell authorize WebSocket messages by application intent inside the text payload, such as event type, channel, action, model/tool name, or service-specific JSON shape, instead of only by host/port, upgrade path, and WEBSOCKET_TEXT. That would be valuable for production SaaS/realtime APIs where the same WebSocket path carries both safe reads and state-changing operations. I am leaving that out of this PR because it is a larger product/schema decision: WebSocket payloads are not standardized, parsers must be protocol-specific and opt-in, and any denial/logging path must preserve the current no-payload/no-secret/no-length-leak guarantees.

Validation after this update:

  • cargo fmt
  • cargo test -p openshell-cli policy_update
  • cargo test -p openshell-sandbox websocket
  • cargo test -p openshell-sandbox secrets
  • cargo test -p openshell-policy
  • cargo clippy -p openshell-cli -p openshell-sandbox -p openshell-policy --all-targets -- -D warnings
  • mise run pre-commit

All passed locally.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Author

ericksoa commented May 9, 2026

Progress update: GraphQL-over-WebSocket semantic policy pass

Implemented in 55f52e73b0194aa68330c224429efb1f919bc142:

  • Added GraphQL-over-WebSocket operation inspection for protocol: websocket endpoints that contain GraphQL operation policy.
  • Reused the existing GraphQL classifier and OPA operation rules for client operation messages instead of adding generic JSON payload matching.
  • Supports modern graphql-transport-ws client subscribe messages and legacy graphql-ws / subscriptions-transport-ws start messages.
  • Treats connection_init, connection_terminate, ping, pong, complete, and stop as GraphQL WebSocket control-plane messages, without payload logging.
  • Keeps raw WebSocket behavior unchanged for endpoints without GraphQL operation policy.
  • Keeps credential rewrite before message classification, including connection_init auth payloads, so placeholder credentials can be resolved before upstream forwarding.
  • Added fail-closed handling for malformed JSON, missing operation ids/payloads, unsupported client message types, GraphQL parse errors, unregistered hash-only persisted queries, and unallowed operations.
  • Made GraphQL policy authoritative for WebSocket operation messages so a generic WEBSOCKET_TEXT rule cannot bypass operation_type / operation_name / fields constraints.
  • Added validation errors for mixed WebSocket transport + GraphQL operation rule fields and for raw WEBSOCKET_TEXT message rules on WebSocket endpoints that use GraphQL operation policy.
  • Added OCSF/log summary assertions so GraphQL-over-WebSocket logging includes only operation metadata and does not include payloads, variables, placeholders, secrets, or secret lengths.
  • Updated policy docs and security guidance for GraphQL-over-WebSocket policy shape.
  • Commit includes DCO signoff.

Validation run:

  • cargo fmt
  • cargo test -p openshell-sandbox websocket
  • cargo test -p openshell-sandbox graphql
  • cargo test -p openshell-sandbox secrets
  • cargo test -p openshell-policy
  • cargo clippy -p openshell-sandbox -p openshell-policy --all-targets -- -D warnings
  • mise run pre-commit

Maintainer decisions still explicit:

  • I kept this as semantic GraphQL-over-WebSocket policy, not generic message-content policy.
  • I did not add selective credential passthrough.
  • I did not rewrite binary frames.
  • I did not add unsafe compression variants.
  • I did not mark the draft PR ready for review.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa changed the title fix(sandbox): rewrite credential placeholders in websocket text frames fix(sandbox): rewrite messaging credential placeholders May 11, 2026
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Secrets rewriting is not applied to websocket traffic.

1 participant